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ABSTRACT 

In this paper we study the tradeoff between parallelism and 
communication cost in a map-reduce computation. For any 
problem that is not "embarrassingly parallel," the finer we 
partition the work of the reducers so that more parallelism 
can be extracted, the greater will be the total communica- 
tion between mappers and reducers. We introduce a model 
of problems that can be solved in a single round of map- 
reduce computation. This model enables a generic recipe 
for discovering lower bounds on communication cost as a 
function of the maximum number of inputs that can be as- 
signed to one reducer. We use the model to analyze the 
tradeoff for three problems: finding pairs of strings at Ham- 
ming distance d, finding triangles and other patterns in a 
larger graph, and matrix multiplication. For finding strings 
of Hamming distance 1, we have upper and lower bounds 
that match exactly. For triangles and many other graphs, 
we have upper and lower bounds that are the same to within 
a constant factor. For the problem of matrix multiplication, 
we have matching upper and lower bounds for one-round 
map-reduce algorithms. We are also able to explore two- 
round map-reduce algorithms for matrix multiplication and 
show that these never have more communication, for a given 
reducer size, than the best one-round algorithm, and often 
have significantly less. 



1. INTRODUCTION 

We assume the reader is familiar with map-reduce [8] and 
its open-souce implementation Hadoop [24]. A brief sum- 
mary can be found in Chapter 2 of [19]. There have been 
many custom solutions using a single round of map-reduce 
for specific problems, e.g 



23 



performing fuzzy joins [3 
clustering [7j, graph analyses [2][2l], multiway join and 
so on. Here, we develop techniques for analyzing problems of 
this type and optimizing the performance on any distributed 
computing environment by explicitly studying an inherent 
trade-off between communication cost and parallelism. 

1.1 Communication and Parallelism for Map- 
Reduce 

This paper offers a model that helps us analyze how suited 
problems are to a map-reduce solution. We focus on two 
parameters that represent the tradeoff involved in designing 
map-reduce algorithms. 

First is the amount of communication between the map 
phase and the reduce phase. Often, but not always, the 
cost of communication is the dominant cost of a map-reduce 
algorithm. To represent the communication cost, we define 



and study replication rate. The replication rate of any map- 
reduce algorithm is the average number of key-value pairs 
that the mappers create from each input. 

The second parameter is the "reducer size." A reducer, 
in the sense we use the term in this paper, is a reduce- 
key (one of the keys that can appear in the output of the 
mappers) together with its list of associated values, as would 
be delivered to a reduce-worker. Reducer size is the upper 
bound on how long the list of values can be. For example, 
we may want to limit a reducer to no more input than can 
be processed in main memory. A reduce-worker may be 
assigned many reduce-keys and works on them one at a time. 
The total computation cost of the reducers is the sum over 
all keys (or "reducers") of the computation cost of processing 
all the values associated with that keyQ 

Limiting reducer size also enables more parallelism. Small 
reducer sizes force us to redesign the notion of a "key" in 
order to allow more, smaller reducers, and thus allow more 
parallelism if enough compute nodes are available. 

1.2 How the Tradeoff Can Be Used 

Suppose we have determined that the best algorithms for 
a problem have replication rate r and reducer size q, where 
r = f(q) for some function q. Look ahead to Fig. [I] for an 
example of what such a function / might look like. In par- 
ticular, be aware that f(q) usually grows as q shrinks. When 
we try to solve an instance of this problem on a particular 
cluster, we must determine the true costs of execution. For 
example, if we are running on EC2 [5], we pay particular 
amounts for communication and for rental of virtual proces- 
sors. The communication cost is proportional to r; the con- 
stant of proportionality depends on the rate EC2 charges for 
communication and the size of our data. The cost of renting 
processors is some function of q. 

Example 1.1. If the reducer must compare all pairs of its 
inputs (e.g., consider the Hamming-distance-based similarity 
join discussed later in Example 2.3), then the work at each 
reducer is 0(q 2 ), and the number of reducers is inversely 
proportional to q, so the total processor cost is proportional 
to q. That is, the cost of solving this instance of our problem 
is ar + bq for some constants a and b. Since r — f(q), the 
cost is af(q) + bq. We find the value of q that minimizes this 
expression. That value tells us which of the algorithms lying 
along the curve r = f(q) should be selected for this jobr\ 

1 Computation cost at the mappers is not treated separately, 
but is incorporated into the communication cost. 
2 Note that typically, /(g) is monotonically decreasing in q, 
so there is a minimum at some finite value of q. 
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If we were concerned more with wall-clock time than with 
total computation cost, then we might add a term represent- 
ing the execution time for a single reducer. In this hypothet- 
ical example, the time to compare (*) pairs is 0(q 2 ), so we 
would minimize a function of the form af(q) + bq + cq 2 . 

Different problems will have different functions r = f(q), 
and they will also have different functions of q that measure 
the computation cost. This function may not be the linear 
or quadratic functions suggested in Example |1.1| However: 

• Deducing the proper function of q to represent the 
computation cost is not harder than analyzing, the- 
oretically or experimentally, the running time of the 
serial algorithm that implements the reduce-function. 

1.3 Outline of the Paper 

There may be many ways to solve nontrivial problems in a 
single round of map-reduce. The more parallelism you want, 
the more communication overhead you face due to having 
to replicate inputs to many reducers. In this paper: 

• We offer a simple model of how inputs and outputs are 
related. We show how our model can capture a varied 
set of problems (Section 

• We define the fundamental tradeoff between 

a) Reducer size: the maximum number of inputs 
that one reducer can receive, and 

b) Replication rate, or average number of key-value 
pairs to which each input is mapped by the map- 
pers. 

• We study three well-known problems: Hamming Dist- 
ance (Section^, triangle finding (Section[4| and some 
generalizations (Section [S]), and matrix multiplication 
(Section [6JI. In each case there is a lower bound on 
the replication rate that grows as reducer size shrinks 
(and therefore as the parallelism grows). Moreover, we 
present algorithms that match these lower bounds for 
various reducer sizes. 

1.4 Related Work 

In [TT], the optimization of theta-join implementation by 
map-reduce was considered from a point of view similar to 
what we propose here. This paper considers only one special 
case of our model, where each output depends on only two 
inputs, and they do not deal with the matter of tradeoff be- 
tween reducer size and communication. An inherent trade- 
off between communication cost and parallelism has been 
studied in different contexts, e.g., pipelined parallelism [11] ; 
we study this trade-off for single round map-reduce jobs. 

The model of [13] proposes that a map-reduce algorithm 
should limit the input size of any reducer to be asymptot- 
ically smaller than the total amount of input. This idea is 
appropriate for eliminating trivial algorithms that really do 
all the work serially in one reducer and thus limits consid- 
eration to algorithms that we might think of as truly par- 
allel. However, it does not let us get into the matter of 
size/communication tradeoffs. 

Map-reduce differs from previous parallel-computation mod- 
els (e.g., PRAM) in that it interleaves sequential and parallel 
computation. Thus the essential constraint on map-reduce 



comes not so much from the demand for parallelism, but 
from the limit on how much input we can expect a reducer 
to handle and how costly communication among processors 
is. For instance, if the input is small enough, then the op- 
timal choice is to run everything at one compute node thus 
minimizing communication, regardless of the asymptotics of 
your algorithm. 

There has been a lot of interest in handling skewed data in 
map-reduce (e.g., [l5 
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The work closer to our setting 
is [14] where the authors propose a slight modification to 
the map-reduce computational framework to allow for small 
amount of communication among the mappers in order to 
decide how to handle skewed data. Handling skewed data is 
not the focus of our paper, but the need to deal with skewed 
data, e.g., graphs with some nodes whose degree is higher 
than the limit q on reducer size, will require alternative al- 
gorithms. 

Our model for describing problems is closely related to the 
notion of data provenance [22]. There has also been some 
work [12| 1 18] on provenance in the context of distributed 
workflows, including map-reduce workflows. 

2. THE MODEL 

The model is simple yet powerful: We can develop some 
quite interesting and realistic insights into the range of possi- 
ble map-reduce algorithms for a problem. For our purposes, 
a problem consists of: 

1. Sets of inputs and outputs. 

2. A mapping from outputs to sets of inputs. The intent 
is that each output depends on only the set of inputs 
it is mapped to. 

There are two non-obvious points about this model: 

• Inputs and outputs are hypothetical, in the sense that 
they are all the possible inputs or outputs that might 
be present in an instance of the problem. Any instance 
of the problem will have a subset of the inputs. We 
assume that an output is never made unless at least 
one of its inputs is present, and in many problems, we 
only want to make the output if all of its associated 
inputs are present. 

• We need to limit ourselves to finite sets of inputs and 
outputs. Thus, a finite domain or domains from which 
inputs are constructed is essential, and a "problem" is 
really a family of problems, one for each choice of fi- 
nite domain(s). We also require that there be a finite 
set of outputs associated with each choice of input do- 
main(s). The values that these outputs can take may 
be a function of the inputs on which each output de- 
pends, and we do not need to specify the domain for 
the output in advance. Example |2.4| illustrates how 
the outputs can compute a function of their associated 
inputs. 

2.1 Examples of Problems 

In this section we offer several examples of common map- 
reduce problems and how they are modeled. 

Example 2.1. Natural join of two relations R(A, B) and 
S(B,C). The inputs are tuples in R or S, and the outputs 
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are tuples with schema (A, B, C). To make this problem fi- 
nite, we need to assume finite domains for attributes A, B, 
and C; say there are Na, Nb, and Nc members of these 
domains, respectively. 

Then there are NaNbNc outputs, each corresponding to a 
triple (a, b, c). Each output is mapped to a set of two inputs. 
One is the tuple R(a, b) from relation R and the other is 
the tuple S(b,c) from relation S. The number of inputs is 
N A N B +N b N c . 

Notice that in an instance of the join problem, not all the 
inputs will be present. That is, the relations R and S will be 
subsets of all the possible tuples, and the output will be those 
triples (a, b, c) such that both R(a, b) and 5(6, c) are actually 
present in the input instance. 

Example 2.2. Finding triangles. We are given a graph 
as input and want to find all triples of nodes such that in 
the graph there are edges between each pair of these three 
nodes. To model this problem, we need to assume a domain 
of size N for the nodes of the input graph. An output is thus 
a set of three nodes, and an input is a set of two nodes. The 
output {u,v,w} is mapped to the set of three inputs {«, v}, 
{u, w}, and {v, w}. Notice that, unlike the previous and next 
examples, here, an output is a set of more than two inputs. 
In an instance of the triangles problem, some of the possible 
edges will be present, and the outputs produced will be those 
such that all three edges to which the output is mapped are 
present. 

Example 2.3. Hamming distance 1. The inputs are bi- 
nary strings, and since domains must be finite, we shall as- 
sume that these strings have a fixed length b. There are thus 
2 b inputs. The outputs are pairs of inputs that are at Ham- 
ming distance 1; that is, the inputs differ in exactly one bit. 
Hence there are (b/2)2 b outputs, since each of the 2 b inputs 
is Hamming distance 1 from exactly b other inputs - those 
that differ in exactly one of the b bits. However, that obser- 
vation counts every pair of inputs at distance 1 twice, which 
is why we must divide by 2. 

Example 2.4. Grouping and aggregation. This example 
illustrates how to deal with a problem where the outputs are 
more than "yes" or "no" responses to whether a given set 
of inputs exists. Here, each output depends on a large set 
of possible inputs, and the result of an output is calculated 
from those of its associated inputs that actually appear in the 
data set. Suppose we have a relation R(A, B) and we want 
to implement group-by-and-sum: 

SELECT A, SUM(B) 
FROM R 
GROUP BY A; 

We must assume finite domains for A and B. An output 
is a value of A, say a, chosen from the finite domain of 
A-values, together with the sum of all the B-values. This 
output is associated with a large set of inputs: all tuples with 
A-value a and any B-value from the finite domain of B. In 
any instance of this problem, we do not expect that all these 
tuples will be present for a given A-value, a, but (unlike the 
previous examples) as long as at least one of them is present 
there will be an output for this value a. 



2.2 Mapping Schemas 

In our discussion, we shall use the convention that q is 
the maximum number of inputs that can be sent to any one 
reducer. 

A mapping schema for a given problem, with a given value 
of q, is an assignment of a set of inputs to each reducer, 
subject to the constraints that: 

1. No reducer is assigned more than q inputs. 

2. For every output, there is (at least) one reducer that is 
assigned all of the inputs for that output. We say such 
a reducer covers the output. This reducer need not be 
unique, and it is, of course, permitted that these same 
inputs are assigned also to other reducers. 

The figure of merit for a mapping schema with a given 
reducer size q is the replication rate, which we defined to be 
the average number of reducers to which an input is mapped 
by that schema. Suppose that for a certain algorithm, the 
ith reducer is assigned qt < q inputs, and let I be the num- 
ber of different inputs. Then the replication rate r for this 
algorithm is 

r = Y j q i /I 

1=1 

Example 2.5. To see one subtlety of the model, consider 
the canonical example of a map-reduce algorithm: word- 
count. In the standard formulation, inputs are documents, 
and the outputs are pairs consisting of a word w and a count 
of the number of times w appears among all the documents. 
The standard algorithm works as follows. The map function 
takes a document, breaks it into words, and for each word 
w, it generates a key-value pair (w, 1). There is one reducer 
for each key (i.e., for each word), and the reduce- function 
sums the 1 's in the list of values it is given for a word and 
thus computes the count for that word. 

It looks like there is a great deal of replication, because each 
input results in as many key-value pairs as there are words. 
However, this view is deceptive. We could just as well have 
thought of the inputs as the word occurrences themselves, 
and then each word occurrence results in exactly one key- 
value pair. That is, the replication rate is 1, independent 
of the limit q on reducer size^ Since the replication rate is 
identically 1, there is no tradeoff at all between q and repli- 
cation rate; i.e., the word-count problem is embarrassingly 
parallel, as we knew all along. 

We want to derive upper and lower bounds on the min- 
imum possible r, as a function of q, for various problems, 
thus demonstrating the tradeoff between high parallelism 
(many small reducers, so q is small) and low overhead (total 
communication cost - measured by the replication rate). 

2.3 Independence of Inputs in the Mappers 

When we calculate bounds on the replication rate we pre- 
tend that we have an instance of the problem where all in- 
puts over the given domain are present. This actually cap- 
tures the nature of map-reduce computation. Normally, in a 
mapper, a map function turns input objects into key-value 

3 Technically, if q is smaller than the number of occurrences 
of a particular word, then this algorithm will not work at 
all. But there is little reason to chose a q that small. 
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pairs independently, without knowing what else is in the 
input. Thus, we can take the assumption that the map- 
ping schema assigns inputs to processors without reference 
to what inputs are actually present. Consequently, the repli- 
cation rate r we calculate represents the expected commu- 
nication if we multiply it by the number of inputs actually 
present, so r is a good measure of the communication cost 
incurred by any instance of the problem. 

Further to this point, recall that q counts the number of 
potential inputs in a reducer, regardless of which inputs are 
actually present for an instance of the problem. However, on 
the assumption that inputs are chosen independently with 
fixed probability, we can expect the number of actual inputs 
at a reducer to be q times that probability, and there is a 
vanishingly small chance of significant deviation for large q. 
If we know the probability of an input being present in the 
data is x, and we can tolerate qi real inputs at a reducer, 
then we can use q = qi/x to account for the fact that not 
all inputs will actually be present. 

2.4 The Recipe for Lower Bounds 

While upper bounds on r for all problems are derived 
using constructive algorithms, there is a generic technique 
for deriving lower bounds. Before proceeding to concrete 
lower bounds, we outline in this section the recipe that we 
use to derive all the lower bounds used in this paper. 

1. Deriving g(q): First, find an upper bound, g(q), on 
the number of outputs a reducer can cover if q is the 
number of inputs it is given. 

2. Number of Inputs and Outputs: Count the total 
numbers of inputs \I\ and outputs \0\. 

3. The Inequality: Assume there are p reducers, each 
receiving qi < q inputs and covering g(qi) outputs. 
Together they cover all the outputs. That is: 



£>(*)> |0| 



(1) 



4. Replication Rate: Manipulate the inequality from 
Equation [l] to get a lower bound on the replication 
rate, which is X/f=i 

Note that the last step above may require clever manip- 
ulation to factor out the replication rate. We have noticed 
that the following "trick" is effective in Step (4) for all prob- 
lems considered in this paper. First, arrange to isolate a 
single factor qi from g(qi); that is: 

X>(sO>|o|^X>^>|o| (2) 



Assuming 



sin) 



is monotonically increasing in qi , we can use 



the fact that V?i : qi < q to obtain from Equation [2] 



g(i) 



> \o\ 



(3) 



Now, divide both sides of Equation [3] by the input size, to 
get a formula with the replication rate on the left: 



\i\ " 9{q)\I\ 



(4) 



Equation [4] gives us a lower bound on r. Thus, in summary, 
given a particular problem, we derive our lower bounds in 
this paper as follows: 

• Suppose the instance of the problem has |7| inputs and 
\0\ outputs. 

• We find an upper bound, g(q), on the number of out- 
puts any q inputs can generate. 

• If g(q)/q is monotonically increasing in q then we can 
compute the replication rate using our recipe. 

• Suppose the maximum number of inputs any reducer 
can take is q. Then the replication rate is r > j^ym ■ 

2.5 Our Results 

We summarize our results in two tables. 

Table[l]gives the lower bounds for each problem we obtain. 
The table enumerates for each problem the total number of 
inputs |J|, number of outputs \0\, the upper bound g(q) 
on the number of outputs q inputs can generate for each 
problem, and the lower bound we derived. 

Table [2] gives the upper bound on the replication rate for 
each problem. In several cases our upper bounds are derived 
using multiple constructive algorithms, giving different up- 
per bounds depending on the input parameters. Therefore, 
Table [2] only gives a representative upper bound for each 
problem, with a forward reference to the section in which 
more detailed results are present. 

3. THE HAMMING-DISTANCE-1 PROBLEM 

We begin with the tightest result we can offer. For the 
problem of finding pairs of bit strings of length b that are at 
Hamming distance 1, we have a lower bound on the repli- 
cation rate r as a function of q, the maximum number of 
inputs assigned to a reducer. This bound is essentially best 
possible, as we shall point to a number of mapping schemas 
that solve the problem and have exactly the replication rate 
stated in the lower bound. 

3.1 Bounding the Number of Outputs 

As described in Section |2.4| our first task is to develop 
a tight upper bound on the number of outputs that can be 
covered by a reducer of size q. 

Lemma 3.1. For the Hamming-distance- 1 problem, a re- 
ducer of size q can cover no more than (q/2) log 2 q outputs. 

Proof. The proof is an induction on b, the length of the 
bit strings in the input. The basis is b = 1. Here, there are 
only two strings, so q is either 1 or 2. If q = 1, the reducer 
can cover no outputs. But (q/2) log 2 q is when q = 1, so 
the lemma holds in this case. If q = 2, the reducer can cover 
at most one output. But (q/2) log 2 (/ is 1 when q — 2, so 
again the lemma holds. 

Now let us assume the bound for b and consider the case 
where the inputs consist of strings of length 6+1. Let X 
be a set of q bit strings of length 6+1. Let Y be the subset 
of X consisting of those strings that begin with 0, and let 
Z be the remaining strings of X - those that begin with 1. 
Suppose Y and Z have y and z members, respectively, so 
q = y + z. 

An important observation is that for any string in Y , there 
is at most one string in Z at Hamming distance 1. That is, 
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\o\ 


9(1) 


Lower bound on r 


Hamming-Distance- 1, fa-bit strings 


2 b 


b2 b 
2 


qlo l 2q (Section 


3.] 


) 


~, — (Section 


3.2 


I 




log, <j \ 


Triangle-Finding, n nodes 


2 


6 


(Section 


4.1 


) 


^7= (Section 


4.1 




Sample Graphs (size s nodes) in Alon 
Class in graph of m edges, n nodes 


(«) or m 


n s 


(Section 5.2 1 


(^) s - 2 or (^)- 
(Sections 5.2 and 5.3 


2 
) 


2-Paths in re-node graph 


(3) 


2 


(I) (Section|5.4.1 




22 (Section 5.4.1 


) 


Multiway Join: N bin. rels, m vars., 
Dom. n, parameter p from 6 




(") 


1" (|6|) 


(Section 


5.5.1 










n x n Matrix Multiplication 


2n 2 


n 2 


y 

(Section ( 


5.1 1 




(Section 


6.1 





Table 1: Lower bound on replication rate r for various problems in terms of number of inputs number 
of outputs \0\, and maximum number of inputs per reducer q. 
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O(^) (Section 4.2 
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Table 2: Representative upper bound on the replication rate r for each problem considered in this paper. 
This table only presents a representative upper bound, with a forward reference to the section that derives 
all upper bounds with constructive algorithms for each problem. 



if Ow is in Y , it could be Hamming distance 1 from lw in Z, 
if that string is indeed in Z, but there is no other string in 
Z that could be at Hamming distance 1 from Ow, since all 
strings in Z start with 1. Likewise, each string in Z can be 
distance 1 from at most one string in Y. Thus, the number 
of outputs with one string in Y and the other in Z is at most 
min(?/, z). 

So let us count the maximum number of outputs that can 
have their inputs within X. By the inductive hypothesis, 
there are at most (y/2) log 2 y outputs both of whose inputs 
are in Y, at most (z/2) log 2 z outputs both of whose inputs 
are in Z, and, by the observation in the paragraph above, 
at most min(y, z) outputs with one input in each of Y and 
Z. 

Assume without loss of generality that y < z. Then the 
maximum number of strings of length 6+1 that can be 
covered by a reducer with q inputs is 

| lo 82 V + o lo S2 Z + V 

We must show that this function is at most (q/2) log 2 q, or, 
since q = y + z, we need to show 

|log 2 y+|log 2 ^ + y< ^y^log 2 (y + z) (5) 

under the condition that z > y. 

First, observe that when y — z, Equation [5] holds with 
equality. That is, both sides become J/(log 2 y + 1). Next, 
consider the derivatives, with respect to z, of the two sides 



of Equation [5] d/dz of the left side is 



while the derivative of the right side is 

1, , , log, e 
^\og 2 {y + z) + -22- 

Since 2 > y > 0, the derivative of the left side is always less 
than or equal to the derivative of the right side. Thus, as z 
grows larger than y, the left side remains no greater than the 
right. That proves the induction step, and we may conclude 
the lemma. □ 

3.2 Lower Bound for Hamming Distance 1 

We can use Lemma [3.1| to get a lower bound on the replica- 
tion rate as a function of q, the maximum number of inputs 
at a reducer. 

Theorem 3.2. For the Hamming-distance- 1 problem with 
inputs of length b, the replication rate r is at least 6/log 2 q. 

Proof. Suppose there are p reducers, and the ith reducer 
has qi < q inputs. We apply our four step recipe described 
in Section l2~4l 

1. Deriving g(q): Recall that g(q) is the maximum num- 
ber of outputs a reducer can cover with q inputs. By 
Lemma |3~l) g(q) = (q/2) log 2 q 

2. Number of Inputs and Outputs: There are 2 bit- 
strings of length b. The total number of outputs is 
(6/2)2 6 . Therefore |I| = 2 b and \0\ = (6/2)2 6 . 
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3 - ELiff(*) > \0\ Inequality: 

and \0\ from above: 



Substituting for g(qi) 



(6) 



Replication Rate: Finally we employ the manipula- 
tion trick from Section ["2.4| where we arrange the terms 
of this inequality so that the left side is the replication 
rate. Recall we must separate a factor qi from other 
factors involving qi by replacing all other occurrences 
of qi on the left by the upper bound q. That is, we re- 
place log 2 qi by log 2 q on the left of Equation [6] Since 
doing so can only increase the left side, the inequality 
continues to hold: 



(7) 



6/ log 2 

W\W2 ■ 



'2 b ' c ) = c. We split each bit string w into c segments, 
■ • w c , each of length b/c. We will have c groups of re- 



ducers, numbered 1 through c. There will be 2 b ~ b ^ c reducers 
in each group, corresponding to each of the 2 b ~ b ^ c bit strings 
of length b — b/c. For i = 1, c, we map w to the Group-i 
reducer that corresponds to bit string wi ■ ■ ■ ■ ■ • u> c , 

that is, w with the ith substring i«, removed. Thus, each 
input is sent to c reducers, one in each of the c groups, and 
the replication rate is c. Finally, we need to argue that the 
mapping schema solves the problem. Any two strings u and 
v at Hamming distance 1 will disagree in only one of the c 
segments of length b/c, and will agree in every other seg- 
ment. If they disagree in their ith segments, then they will 
be sent to the same Group-i reducer, because we map them 
to the Group-i reducers ignoring the values in their ith seg- 
ments. Thus, this Group-i reducer will cover the output pair 
<u, v>. 



The replication rate is r = J2i=i <HlV\ = J2i=i <n/1 b - 
We can move factors in Equation [7] to get a lower 
bound onr = Sf=i 9*/2 b > b/ log 2 q, which is exactly 
the statement of the theorem. 



□ 

3.3 Upper Bound for Hamming Distance 1 

There are a number of algorithms for finding pairs at Ham- 
ming distance 1 that match the lower bound of Theorem |3.2| 
First, suppose q = 2; that is, every reducer gets exactly 2 
inputs, and is therefore responsible for at most one out- 
put. Theorem |3.2| says the replication rate r must be at 
least b/ log 2 2 = 6. But in this case, every input string w of 
length 6 must be sent to exactly b reducers - the reducers 
corresponding to the pairs consisting of w and one of the b 
inputs that are Hamming distance 1 from w. 

There is another simple case at the other extreme. If 
q — 2 b , then we need only one reducer, which gets all the 
inputs. In that case, r = 1. But Theorem |3.2| says that r 
must be at least 6/log 2 (2 b ) = 1. 

In [3], there is an algorithm called Splitting that, for the 
case of Hamming distance 1 uses 2 1+b ^ 2 reducers, for some 
even b. Half of these reducers, or 2 b ^ 2 reducers correspond 
to the 2 6 / 2 possible bit strings that may be the first half of 
an input string. Call these Group I reducers. The second 
half of the reducers correspond to the 2 i>/ ' 2 bit strings that 
may be the second half of an input. Call these Group II 
reducers. Thus, each bit string of length 6/2 corresponds to 
two different reducers. 

An input w of length b is sent to 2 reducers: the Group-I 
reducer that corresponds to its first 6/2 bits, and the Group- 
II reducer that corresponds to its last 6/2 bits. Thus, each 
input is assigned to two reducers, and the replication rate 
is 2. That also matches the lower bound of 6/ log 2 (2 6//2 ) = 
6/(6/2) = 2. It is easy to observe that every pair of inputs at 
distance 1 is sent to some reducer in common. These inputs 
must either agree in the first half of their bits, in which case 
they are sent to the same Group-I reducer, or they agree on 
the last half of their bits, in which case they are sent to the 
same Group-II reducer. 

We can generalize the Splitting Algorithm so that for 
any c > 2 such that c divides 6 evenly, we can have re- 
ducer size 2 b / c and replication rate c. Note that for reducer 
size 2 i>/ ' c , the lower bound on the replication rate is exactly 
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1 b/6b/5 b/4 b/3 b/2 b 
log 2 1 

Figure 1: Known algorithms matching the lower 
bound on replication rate 

Figure [l] illustrates what we know. The hyperbola is the 
lower bound. Known algorithms that match the lower bound 
on replication rate are shown with dots. 

3.4 An Algorithm for Large q 

The lower bound in Fig. [T] is matched for many values of 
q as long as log 2 q < b/2. However, what happens between 
6/2 and 6 is less clear. Surely r < 2 for that entire range. 
In this subsection and the next we shall show that there are 
algorithms for log 2 q near 6 with replication rates strictly 
less than 2. 

There is a family of algorithms that use reducers with 
large input - q well above 2 6//2 , but lower that 2 b . The 
simplest version of these algorithms divides bit strings of 
length 6 into left and right halves of length 6/2 and organizes 
them by weights, as suggested by Fig. [2] The weight of a bit 
string is the number of l's in that string. In detail, for some 
k, which we assume divides 6/2, we partition the weights 
into 6/ (2k) groups, each with k consecutive weights. Thus, 
the first group is weights through k — 1, the second is 
weights k through 2k — 1, and so on. The last group has an 
extra weight, 6/2, and consists of weights | — k through 6/2. 

There are (^) 2 reducers; each corresponds to a range 
of weights for the first half and a range of weights for the 
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Left-half 
weight 



Right-half 
weight 



Figure 2: Partitioning by weight. Only the border 
weights need to be replicated 



second half. A string is assigned to reducer for i,j = 

1,2, ... , b/2k if the left half of the string has weight in the 
range (i — l)k through ik — 1 and the right half of the string 
has weight in the range (j — l)k through jk — 1. 

Consider two bit strings wo and wi of length 6 that differ 
in exactly one bit . Suppose the bit in which they differ is 
in the left half, and suppose that w\ has a 1 in that bit. 
Finally, let Wi be assigned to reducer R. Then unless the 
weight of the left half of wi is the lowest weight for the left 
half that is assigned to reducer R, wo will also be at R, and 
therefore R will cover the pair {wo,u>i}. However, if the 
weight of wi in its left half is the lowest possible left-half 
weight for R, then wo will be assigned to the reducer with 
the same range for the right half, but the next lower range 
for the left half. Therefore, to make sure that wo and wi 
share a reducer, we need to replicate wi at the neighboring 
reducer that handles wq. The same problem occurs if Wo 
and Wi differ in the right half, so any string whose right half 
has the lowest possible weight in its range also has to be 
replicated at a neighboring reducer. We suggested in Fig. [2] 
how the strings with weights at the two lower borders of the 
ranges for a reducer need to be replicated at a neighboring 
reducer. 

Now, let us analyze the situation, including the maximum 
number q of inputs assigned to a reducer, and the replication 
rate. For the bound on q, note that the vast majority of the 
bit strings of length n have weight close to n/2. The num- 
ber of bit strings of weight exactly n/2 is („%)■ Stirling's 

approximation [SI gives us 2 n /v / 27rn for this quantity. That 
is, one in 0(^/n) of the strings have the average weight. 

If we partition strings as suggested by Fig. [2] then the 
most populous k x k cell, the one that contains strings with 
weight 6/4 in the first half and also weight 6/4 in the second 
half, will have no more than 



2 b/2 2 ^ k 2 2 b 

strings assigned]^] If k is a constant, then in terms of the 
4 Note that many of the cells have many fewer strings as- 



horizontal axis in Fig. [T] this algorithm has log 2 q equal to 
6 — log 2 6 plus or minus a constant. It is thus very close to 
the right end, but not exactly at the right end. 

For the replication rate of the algorithm, if k is a constant, 
then within any cell there is only a small ratio of variation, 
among all pairs assigned to that cells, of the numbers 

of strings with weights i and j in the left and right halves, 
respectively. Moreover, when we look at the total number 
of strings in the borders of all the cells, the differences av- 
erage out, so the total number of replicated strings is very 
close to (2k) /k 2 = 2/k. That is, a string is replicated if 
either its left half has a weight divisible by k or its right half 
does. Note that strings in the lower-left corner of a cell are 
replicated twice, strings of the other 2k — 2 points on the 
border are replicated once, and the majority of strings are 
not replicated at all. We conclude that the replication rate 
isl+f. 

3.5 Generalization to d Dimensions 



The algorithm of Section |3.4| can be generalized from 2 
dimensions to d dimensions. Break bit strings of length 6 
into d pieces of length b/d, where we assume d divides 6. 
Each string of length 6 can thus be assigned to a cell in a 
d-dimensional hypercube, based on the weights of each of its 
d pieces. Assume that each cell has side k in each dimension, 
where k is a constant that divides b/d. 

The most populous cell will be the one that contains strings 
where each of its d pieces has weight 6/ (2d). Again using 
Stirling's approximation, the number of strings assigned to 
this cell is 



2 b/d d k d 2 b 

.j^bTd) = b^(2n/dy/2 

On the assumption that k is constant, the value of log 2 q is 

6-(d/2)log 2 6 

plus or minus a constant. 

To compute the replication rate, observe that every point 
on each of the d faces of the hypercube that are at the low 
ends of their dimension must be replicated. The number of 
points on one face is so the sum of the volumes of the 

faces is dk d ~ x . The entire volume of a cell is k d , so the frac- 
tion of points that are replicated is d/k, and the replication 
rate is 1 + d/k. Technically, we must prove that the points 
on the border of a cell have, on average, the same number 
of strings as other points in the cell. As in Section |3.4| the 
border points in any dimension are those whose correspond- 
ing substring has a weight divisible by k. As long as k is 
much smaller than b/d, this number is close to 1/fcth of all 
the strings of that length. 

3.6 Larger Hamming Distances 

Unfortunately, the analysis for Hamming distance 1 does 
not generalize easily to higher distances. To see why, con- 
sider Hamming distance 2. While for Hamming distance 1 
we learned that there is an 0(q\ogq) upper bound on the 
number of outputs covered by a reducer with q inputs, for 



signed, and in fact a large fraction of the strings have weights 
within \fb of 6/4 in both their left halves and right halves. 
In the best implementation, we would combine the cells with 
relatively small population at a single compute node, in or- 
der to equalize the work at each node. 
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distance 2 this bound is much higher: Q,(q 2 ), at least for 
small q. That prevents us from getting a good lower bound 
on replication rate. 

The Q{q 2 ) bound comes from an algorithm from [3 called 
"Ball-2" that creates one reducer for each string of length 
b. For distance 2, this algorithm assigns to the reducer for 
string s all those strings at distance 1 from s. Notice that 
all distinct strings at distance 1 from s are distance 2 from 
each other. Thus, each reducer covers (2) outputs. Since 
q = b, each reducer covers or about b 2 /2 outputs. 

On the other hand, we can generalize the upper bound 
of Section |3.3| to distance d. We divide the b bits of input 
strings into k equal-length pieces. A reducer corresponds 
to a choice of d of the k pieces to delete and a bit string of 
length 6(1 — d/k) corresponding to the k — d pieces of a string 
that are not deleted.. An input s is sent to reducers - 
those corresponding to the strings we obtain by deleting d 
of the k segments of string s. Thus, the replication rate 
is approximately k d /d\, assuming k is much larger than d. 
Again using Stirling's approximation for the factorial, this 
replication rate is approximately r — (ek/d) d . 

4. TRIANGLE FINDING 

We shall now consider the problem of finding triangles, 
introduced in Example |2.2| We shall first derive a lower 
bound assuming that all possible edges in the data graph 
can be present. That assumption follows our model, since 
we assume every possible output can be made, and every 
possible input could be present. However, applications of 
triangle-finding, such as in analysis of communities in social 
networks are generally applied to large but sparse graphs. 
As a result, we shall continue the analysis by showing how 
to adjust the bound q on reducer size to take into account 
the fact that most inputs will not be present. When we 
make this adjustment, we see that the lower bound we get 
matches, to within a constant factor, the upper bound ob- 
tained from known algorithms. 

4.1 Lower and Upper Bound for Finding Tri- 
angles 

Recall that, as described in Example |2.2[ the inputs are 
the possible edges of a graph, and the outputs are the triples 
of edges that form a triangle. Suppose n is the number of 
nodes of the input graph. Following the recipe from Sec- 
tion [231 



1. Deriving g(q): We claim a reducer with q inputs can 
cover at most ^-q 3 ^ 2 outputs (triangles), which hap- 
pens when the reducer is sent all the edges among a set 
of k = \f2q nodes. This point was proved, to within 
an order of magnitude in [2l], who in turn credit the 
thesis of Schank 20 ^] Suppose we assign to a reducer 
all the edges among a set of k nodes. Then there are 
(2) edges assigned to this reducer, or approximately 
k 2 /2 edges. Since this quantity is q, we have k — \f2q. 
The number of triangles among k nodes is (g) , or ap- 
proximately k 3 /6 outputs. In terms of q, the upper 



bound on the number of outputs is 



V2 n 3/2 



5 What is actually proved is that among q edges, you can 
form at most 0(q 3 ^ 2 ) triangles. However, picking a set 
of nodes and all edges among them will match this upper 
bound. 



2. Number of Inputs and Outputs: The number of 
inputs is (2) or approximately n 2 /2. The number of 
outputs is (3), or approximately n /6. 

3. Y^r=i 9(li) — \0\ Inequality: So using the formulas 
from (1) and (2), if there are p reducers each with < q 
inputs: 



\/2 3/2 . 3 ,„ 



(8) 



We can replace a factor of ^fqi on the left of EquationJS] 
by ■ s fq, since q > q%, and then move that factor to the 
denominator of the right side. Thus, 



(9) 



4. Replication Rate: The replication rate is J3f =1 q% 
divided by the number of inputs, which is n 2 /2 from 
(1). We can manipulate Equation [9] as per the trick in 
Section |2T4| to get 



2q 

Upper Bound: There are known algorithms that, to 
within a constant factor, match the lower bound on replica- 
tion rate. See [2l] and [2]. These algorithms are stated in 
terms of the number of edges, m, rather than the number of 
nodes, n. However, for the case m = f^), which is what we 
assume when we consider all possible edges and triangles, 
these algorithms do in fact imply a replication rate that is 
Oinf yfq). We shall next consider how to modify the analy- 
sis on the assumption that the true input will consist of m 
randomly chosen edges. 

4.2 Analysis for Sparse Data Graphs 

The lower bound r = Q(n/^/q) holds on the assumption 
that all edges are actually present in the input. But as we 
pointed out, commonly the data graph to which a triangle- 
finding algorithm is applied is sparse. We shall show that, 
with essentially the same limitation q on the number of edges 
that any reducer must deal with, the lower bou nd on repli- 
cation rate can be transformed to r = Q,{\Jm/q). 

Suppose the data graph has m of the possible (j) edges, 
and that these edges are chosen randomly. Then if we want 
no more than an expected value of q for the number of edges 
input to any one reducer, we can actually assign a "target" 
qt — qn(n — l)/2m of the possible edges to one reducer and 
know that the expected number of edges that will actually 
arrive will be q. 

We already know from Section |4.1| that if we assign at 
most q t of the (2) possible edges to any reducer, then the 
replication rate r is Q(n/^/qt). But on the assumption that 
only m edges are truly present in the input, qt is 0(qn 2 /m), 
from which we can conclude 



r = fi(n / \J qn 2 j 'm) - 

This lower bound is met (to within a constant factor) by 
the algorithms of [2l] and [2] when we measure reducer size 
in terms of the number of edges m (as these papers do), 
rather than in terms of the number of possible edges Q). 
There is a natural concern that a random selection of the 
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edges will cause more than q actual edges to be assigned to 
some of the reducers. However, we are only claiming bounds 
to within a constant factor, and by lowering the target q t by, 
say, a factor of 2, we can make the probability that one or 
more reducers will get more than q actual edges as low as 
we like for large n and m. 

5. FINDING INSTANCES OF OTHER GRAPHS 

The analysis of Section [4] extends to any sample graph 
whose instances we want to find in a larger data graph. For 
each problem of this type, the sample graph is fixed, while 
the data graph is the input. Previously, we looked only at 
the triangle as a sample graph, but we could similarly search 
for cycles of some length greater than 3, or for complete 
graphs of a certain size, or any other small graph whose 
instances in the data graph we wanted to find. 

5.1 The Alon Class of Sample Graphs 

In [I], Noga Alon analyzed the maximum number of oc- 
currences of a sample graph that could occur in a data graph 
of n nodes and m edges. In particular, he defined a class of 
graphs, which we shall call the Alon class of sample graphs. 
These graphs have the property that we can partition the 
nodes into disjoint sets, such that the subgraph induced by 
each partition is either: 

1. A single edge between two nodes, or 

2. Contains an odd-length Hamiltonian cycle. 

The sample graph may have any other edges as well. The 
Alon class is very rich. Every cycle, every graph with a 
perfect matching, and every complete graph is in the Alon 
class. Paths of odd length are also in the Alon class, since 
we may use alternating edges along the path as a decompo- 
sition. However, paths of even length are not in the Alon 
class, since there are no cycles of any length, and the odd 
number of nodes cannot be partitioned into disjoint edges. 

5.2 Lower Bound for the Alon Class 

The key result from 4 that we need is that for any sample 
graph S in the Alon class, if S has s nodes, then the number 
of instances of S in a graph of m edges is 0(m S//2 ). So if the 
ith reducer has qi inputs, the number of instances of S that it 
can find is 0(ql /2 ). But if all edges are present, the number 
of instances of S is Q(n s ). Note the number of instances 
need not be exactly n s , since there may be symmetries in 
S as we saw for the case of the triangle. However, there 
are surely at least n s / s\ distinct sets of nodes that form the 
sample graph S. 

Now, we can repeat the analysis we did for the triangle. 
If there are p reducers, and the ith reducer has qt inputs, 
then 

i=l 

If q is an upper bound on qi, we can write the above as 



5.3 Bounds in Terms of Edges 

As we did for triangles, we can scale q up by a factor of 
n 2 /m if we assume that the actual data is m out of the Q) 
possible edges. If we do so, the lower bound on r becomes 



E 



<h<i 



(s/2)- 



The number of inputs is 



= 0(n s ) 



Thus, the replication rate r 



, = I2k* = n{n s - 2 /q {s - 2)/2 ) = n((n/^q) 8 - 2 ) 



r = Q. 



((n/^iq^M) s - 2 ) =n((^nT q y- 2 ) 



The algorithm given in matches this lower bound, to 
within a constant factor. 

5.4 Paths of Length Two 

The analysis for sample graphs not in the Alon class is 
harder, and we shall not try to give a general rule. However, 
to see the problems that arise, we will look at the simplest 
non-Alon graph: the path of length 2 (2-paths). The prob- 
lem of finding 2-paths is similar, although not identical to, 
the problem of computing a natural self join 

E(A, B) txi E(B,C) 

The difference is that the edge relation E contains sets of 
two nodes, rather than ordered pairs. That is, if a tuple 
(u, v) is in E, when finding 2-paths we can treat it as (v, u), 
even if the latter tuple is not found in E. 

5.4. 1 Lower Bound 
We again follow the recipe from Section [2. 4| 



Deriving g(q): Any two distinct edges can be com- 
bined to form at most one 2-path. Thus, the number 
of outputs (2-paths) covered by this reducer is at most 



(2) or approximately q /2. 



Number of Inputs and Outputs: \I\ is (™) or ap- 
proximately n 2 /2. For counting \ 0\ , observe that there 
are Q) sets of three nodes, and any three nodes can 
form a 2-path in three ways. That is, any of the three 
nodes can be chosen to be the middle node. Thus, \0\ 
is 3(™), or approximately 3n 3 /6 = n 3 /2. 



from (1) and 
< q inputs: 



\0\ Inequality: Using the formulas 
(2), if there are p reducers each with 



j2q 2 /2>n 3 /2 



Replacing a factor of qt by q on the left: 



£>)(9/2) > n 3 /2 



(10) 



(11) 



Replication Rate: We rearrange terms in Equation|ll| 

to make the left side equal to X] P =i Qi divided by |7| = 
n 2 /2. 



n 2 /2 



> 2n/q 



This lower bound on replication rate is unlike those we 
have seen before. For small q it makes sense, but for q > 
2n it is less than 1, which is useless. Rather, it should be 
replaced by the trivial lower bound r > 1 for large n. Once 
we make this replacement, the bound is tight for an infinite 
number of pairs of q and n. If q = n 2 /2, then we can send 



(2) 



all edges to one reducer and do the work there, so r 
correct. 
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5.4.2 Upper Bound 

If q = n, then we can have one reducer for each node. We 
send the edge (a, b) to the reducers for its two nodes a and b. 
The replication rate is thus 2, which agrees with the lower 
bound. The reducer for node u receives all edges consisting 
of u and another node, so it can put them together in all 
possible ways and produce all 2-paths that have u as the 
middle node. 

If q < n, we have to divide the task of producing the 2- 
paths with middle node u among several different reducers. 
That means every pair of edges with u as one end has to 
be assigned to some reducer in common. Suppose for con- 
venience that k 2 divides n. Suppose ft is a hash function 
that divides the n nodes into k equal-sized buckets. The re- 
ducers will correspond to pairs [u, {i,j}], where it is a node 
(intended to be the node in the middle of the 2-path), and 
i and j are bucket numbers in the range 1,2, ... ,k. There 
are thus nfZ) or approximately nk 2 /2 reducers. 

Let (a, b) be an edge. We send this edge to the 2(k — 
1) reducers [b, {h(a), *}] and [a, {*, h(b)}], where * denotes 
any bucket number from 1 to other than the other bucket 
number in the set. We claim that any 2-path is covered 
by at least one reducer. In particular, look at the reducer 
[u, {i,j}]. This reducer covers all 2-paths v — u — w such 
that h(v) and h(w) are each either i or j. Note that if 
h(y) = h(w), then many reducers will cover this 2-path, 
and we want only one to produce it. So we let the reducer 
[u, {i, j}] produce the 2-path v — u — iv if either 

1. One of h(v) and h(w) is i and the other is j, or 

2. h(v) = h(w) = i and j = i + 1 modulo k (i.e., j = 
i + l<kori — k and j = 1). 

Each reducer receives q = 2n/k edges, and as mentioned, 
the replication rate r is 2(k — 1), or approximately 2k. Since 
2n/q — k, the lower bound is approximately half what this 
algorithm achieves. Thus, to within a constant factor, the 
upper and lower bounds match for small q as well as for 
large q (where both bounds are between 1 and 2). 

5.5 Multiway Join 

We begin by looking at the join of several binary relations. 
We can think of this extension as looking for sample graphs 
in a data graph with labeled edges; the relation names are 
the edge labels. Suppose n is the number of nodes of the 
data graph. The inputs are the possible edges of a graph, 
and the outputs are the sets of s edges that make the body 
of the multiway join true (i.e., the s labeled edges of the 
sample graph). We assume also that the multiway join seen 
as a Datalog rule, (or as a hypergraph) uses m variables (m 
attributes/nodes in the hypergraph equivalently) . 

5.5.1 A Lower Bound for Multiway Join 
Following the recipe from Section [2. 4| 



1. Deriving g(q): According to [6] when we have q in- 
puts in a multiway join, then we can have at most 
g(q) = q p outputs where p is a parameter that de- 
pends on properties of the hypergraph associated with 
the specific multiway join. E.g., if the hypergraph has 
pi edges that cover all the nodes, and this is the mini- 
mum number of edges with this property, then p = p\. 
Otherwise, p comes from the solution of a linear pro- 
gram that is associated to the hypergraph (see [f| for 



details of how to compute p). From here on, we drop 
constant factors, but do not use the implied big-oh 
notation, for simplicity. 

2. Number of Inputs and Outputs: \I\ — s(") or on 
the order of n 2 . \0\ — ( n ) or on the order of n m . Note 
that m here is a constant, so in big-oh calculations we 
can drop the factor 1 /m! when approximating binomial 
coefficients. 

3. 53^ =1 g(<li) > \0\ Inequality: Replacing for g(q) and 
\0\ from above: 



(12) 



We can replace a factor of q? 1 on the left of Equa- 
by q p , since q > q%, and then move that 



tion 
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factor to the denominator of the right side. Thus, 

EJU» > ^/q"- 1 (13) 

4. Replication Rate: The replication rate is Yli=i 1* 
divided by the number of inputs, which is n m from 
(1). We can manipulate Equation |13| as per the trick 
in Section [2. 4| to get 

„_ES=i* ^ n— 2 



7P-1 



This lower bound can be easily generalized from binary 
relations to the case where all relations have the same arity 
a > 2. In order to have a more quantitative picture let us 
assume also that p = s/ct where s is the number of relational 
atoms in the join. Then the replication rate lower bound is: 



r > 



t s/u-l 



To get a more quantitative picture, we can take the special 
case where s = m, i.e., when the number of relational atoms 
in the join and the number of shared variables coincide (e.g., 
the join corresponds to a hypertree with an additional edge). 
In this case the lower bound becomes r > n m ~ a q 1 ^ m ^ a . 

Below we discuss in detail some algorithms from the lit- 
erature for chain joins that match this lower bound. 

5.5.2 Upper Bound for Cases of Multiway Join 

Chains of odd number of relations. Suppose we have 
N relations in the chain and iV is an odd positive integer. 
Then, let us compute more carefully the above lower bound. 
We have now m — N + 1 and p = (N + l)/2. Hence the 
lower bound is: 



r > 



n 

(JV+1J/2-1 



For the upper bound we use the results in [U. This pa- 
per computes the communication cost for when we have p 
reducers (denoted k in [l]), each relation has size R and 
there are N relations in the join, hence the total input size 
is |/| = RN. In Jl] the expression that gives the communi- 
cation cost is given as the sum of iV terms, each a product of 
shares (denoted ai's in [I]). The a^s are computed in there 
and if we do the arithmetic, we get communication cost per 
input (hence replication rate) to be equal to (up to a factor 

4 JV-3 

of p" 2 -' ): r = p"- 1 . After similar arithmetic manipula- 
tions as in previous sections, we get this upper bound on the 
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replication rate to be: 

r = { n /Vv) N ~ 1 

This upper bound matches the lower bound we computed. 
The case for even number of relations in a chain query is 
similar with the same conclusion. 

Star joins. A star join joins a central fact table with sev- 
eral dimension tables. It is expected that the fact table is 
very large while the dimension tables are smaller but still 
large. Suppose the size of the fact table is /, and all dimen- 
sion tables have the same size do. Then according to [I], 
in order to minimize the communication cost, the share for 
the attributes not in the fact table is 1, while, the share for 
each attribute in the fact table is dop 1 ^ /do = p 1,/JV , where 
N is the number of dimension tables and p is the number 
of reducers. We assume, moreover, that dimension tables 
pairwise do not share attributes. Thus, using the communi- 
cation cost as computed in [I] and dividing it by / + Ndo 
we get the replication rate: 

N-l 

f + Ndop^fr- 



We replace a factor of q^ 1 on the left of Equation 14 



/ + Ndo 

To compute p in terms of the average reducer size, we take 
the equation r(f + Ndo) = PQ, which after replacing r from 

above and dividing by p becomes: f /p + iVdop - ™" = q. In 
order to simplify the calculations, we assume that f/p~ 
(1 — e)q with e being between and 1 but not very small or 
very large. This is a reasonable assumption, since the size of 
the fact table is much larger than the sizes of the dimension 
tables; it tells us that a good fraction of the input to each 
reducer comes from the fact table. Then we can solve the 
above to get: p = (Ndo/eq) N , and substituting it in the 
replication rate: 

= f + Nd (Nd /eq) N - 1 
f + Nd 

Substituting in the enumerator / = pq(l — e) we get: 

_ e(l - e)Nd {Nd /eq) N - 1 
7 + Nd 

A Lower Bound for Star Join. Since the practical 
applications of star join assume that the fact table is order 
of magnitude larger than the dimension tables, we make a 
similar assumption here. Let N be the number of dimension 
tables. Suppose we have in our database 7ii constants (val- 
ues) that are values to the attributes outside the fact table, 
and the arity of each dimension table is m = mi +ni2 where 
nri2 is the number of attributes that is shared with the fact 
table. Then we have at most n^ mi tuples in the output of 
the join. Notice that this number is in general much less 
than the size / of the fact table. We apply our technique for 
finding lower bounds as follows: 

1. Deriving g(q): The parameter p is equal to N, and 
thus g(q) = q . 

2. Number of Inputs and Outputs: \I\ is f + n™ 1 . 

\0\ is n 1 mi . Notice that the number of outputs only 
depends on the dimension tables' parameters. 

3. Yli=i 9{li) — \0\ Inequality: Substituting for g(qi) 
and \0\: 



by q N 1 , and then move that factor to the denomina- 
tor of the right side: 



4. Replication Rate: We get from Equation |15| 



(15) 



/. mi — r i mi 

We can write the above inequality as: 

= NdojNdo/gf- 1 
r f + Ndo 

This differs from the lower bound we computed by a 
factor of e(l — e)/e , which under the assumptions of 
the star join can be thought of a constant. 

Size of output of multiway join in the general case. 

For the general case, we can apply the same technique to get 
lower bounds on the replication rate, only we need to know 
how to compute a bound on the size of the output of any 
multiway join. We explain here how to compute a tight 
bound as offered in [6] [lO] . 

Let q be a multiway join and let G(q) be the correspond- 
ing hypergraph. Thus the nodes of the hypergraph are the 
attributes in the query and the edges of the hypergraph cor- 
respond to the relational atoms in the query. For each edge 
e of G(q) we have a variable x e . Let S be the number of 
subgoals in the query and a E be the number of attributes for 
the relational atom corresponding to the edge e. We form 
the following linear program: 



> s 



(14) 



mini'mizeS l , e G(q) x e 

The solution to this program is called an optimal fractional 
edge cover of the query hypergraph. It can be shown [To] 
that there is always a solution whose values are rational and 
of bit-length polynomial in the size of the query. Fractional 
edge covers can be used to give an upper bound on the size 
\0\ of the output of the query. Let \R e \ be the size of the 
relation that corresponds to the edge e of the hypergraph 
G(q). 

\0\ < IL e£ G(q)\Re\ 

6. MATRIX MULTIPLICATION 

We shall now take up the common application of matrix 
multiplication. That is, we suppose we have n x n matrices 
R = [rij] and S = [sjk] an< A we wish to form their product 
T = [tik], where to, = $^" =1 rijSjk- This problem introduces 
a number of ideas not present in the previous examples. 
First, each output depends on many inputs, rather than just 
two or three. In particular, the output tik depends on an 
entire row of R and and entire column of S, that is, 2n 
inputs, as suggested by Fig. [3] 

There is also an interesting structure to the way outputs 
are related to inputs, and we can exploit that structure. 
Finally, the fact that sum is associative and commutative 
lets us explore methods that use two interrelated rounds 
of map-reduce. Surprisingly, we discover that two-round 
methods are never worse than one-round methods, and can 
be considerably better. 
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Figure 3: Input /output relationship for the matrix- 
multiplication problem 

6.1 The Lower Bound on Replication Rate 

1. Deriving <?(?): Suppose a reducer covers the outputs 
ti4 and t23- Then all of rows 1 and 2 of R are input 
to that reducer, and all of columns 4 and 3 of S are 
also inputs. Thus, this reducer also covers outputs 
ii3 and t24. As a result, the set of outputs covered 
by any reducer form a "rectangle," in the sense that 
there is some set of rows ii, i?, . . . i w of R and some 
set of columns fei , fea , • • • , kh of 5* that are input to 
the reducer, and the reducer covers all outputs t iu k v , 
where 1 < u < w and 1 < v < h. 

We can assume this reducer has no other inputs, since 
if an input to a reducer is not part of a whole row of R 
or column of 5*, it cannot be used in any output made 
by the reducer. Thus, the number of inputs to this 
reducer is n(w + h), which must be less than or equal 
to q, the upper bound on the number of inputs to a 
reducer. As the total number of outputs covered is gh, 
it is easy to show that for a given q, the number of 
outputs is maximized when the rectangle is a square; 
that is, W = h = q/(2n). In this case, the number of 
outputs covered by the reducer is g(q) = q 2 /(An 2 ). 

2. Number of Inputs and Outputs: There are two 
matrices each of size n . Therefore \I\ = 2n 2 and 
\0\=n 2 . 

3. Y^r=i 9(li) — \0\ Inequality: Substituting for g(qi) 
and \0\: 



4. Replication Rate: We first leave one factor of qi 
on the left as is, and replace the other factor qi by q. 
Then, we manipulate the inequality so the expression 
on the left is the replication rate and obtain: 

P r, 2 

^ 2n 2 ~ q 

6.2 Matching Upper Bound on Replication Rate 

The lower bound r > 2n 2 /q can be matched by an upper 
bound for a wide range of q's. If q > 2n 2 , then the entire job 
can be done by one reducer, and if q < 2n, then no reducer 
can get enough input to compute even one output. Between 
these ranges, we can match the lower bound by giving each 
reducer a set of rows of R and an equal number of columns 
of 5. 



The technique of computing the result of a matrix multi- 
plication by tiling the output by squares is very old indeed 
[16] . In the map-reduce model, that is correct if a single 
round of map-reduce is used, but, as we shall see in Sec- 
tion [63] not quite correct for two-phase matrix multiplica- 
tion, where the minimum cost occurs when the matrices are 
tiled with rectangles of aspect ratio 2:1. 

Let s be an integer that divides n, and let q = 2sn. Par- 
tition the rows of R into n/s groups of s rows, and do the 
same for the columns of S. There is one reducer for each 
pair (G, H) consisting of a group G of 7?'s rows and a group 
H of <S"s columns. This reducer has q = 2sn inputs, and can 
produce all the outputs tik such that i is one of the rows in 
group G and k is one of the columns in the group H. Since 
every pair (i, k) has i in some group for R and has k in some 
group for S, every element of the product matrix T will be 
produced by exactly one reducer. 

The replication rate for each input element is the number 
of groups with which its group is paired. That number is 
r = n/s, since both R and S are partitioned into this number 
of groups. Since q — 2sn, and thus s — q/ (2n), we have that 
r = 2n 2 /q, exactly matching the lower bound on r. 

6.3 Matrix Multiplication Using Two Phases 

There is another strategy for perfoming matrix multipli- 
cation using two map-reduce jobs. As we shall see, this 
method always beats the one-phase method. An interesting 
aspect of our analysis is that, while tiling by squares works 
best for the one-phase algorithm, 

• For the two-phase algorithm, the least cost occurs when 
the matrices are tiled with rectangles that have aspect 
ratio 2:1. 

We assume that we are multiplying the same n x n ma- 
trices R and S as previously in this section. 

1. In the first phase, we compute Xijk = r ij s jk for each 
i, j, and k between 1 and n. We sum the XijkS at a 
given reducer if they share common values of i and k, 
thus producing a partial sum for the pair (i, k). 

2. In the second phase, the partial sum for each pair (i, k) 
is sent from each reducer that has computed at least 
one Xijk for some j to a reducer of the second phase 
whose responsibility to to sum all these partial sums 
and thus compute Uk- 

Figure [4] suggests what the mappers and reducers of the two 
phases do. 

The second phase is embarrassingly parallel, since each 
partial sum contributes to only one output. However, the 
first phase requires careful organization. To begin, it is not 
sufficient to compute only the replication rate of the first 
phase, since there is significant communication in the second 
phase. The number of partial sums could be as large as n 3 
and thus dominate the communication cost. We shall thus 
calculate the total communication involved in both phases. 

To begin this calculation, note that the mappers of the sec- 
ond phase can reside at the same compute node as the Xijk's 
to which they apply. Thus, no communication is needed be- 
tween the first-phase reducers and the second-phase map- 
pers. The communication between the second-phase map- 
pers and reducers is equal to the sum over all first-phase re- 
ducers of the number of different (i, k) pairs for which they 
compute at least one Xijk- 
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Phase 1 Phase 2 




inputs 



mappers reducers mappers reducers 

Figure 4: The two-phase method of matrix multipli- 
cation 

The communication between first-phase mappers and re- 
ducers depends on the limit q we choose for the number of 
inputs to a reducer in the first phas^] and on the strategy 
we use for assigning inputs to these reducers. As for the 
one-phase algorithm, we can simplify the options regarding 
what inputs go to what reducers by observing that the set 
of outputs covered by a reducer again forms a "rectangle." 
That is, if a reducer covers both Xijk and x V j Z , then it also 
covers Xij Z and Xyjk- The proof is that to cover Xijk the 
reducer must have inputs m and Sjk, while to cover x y j z 
the same reducer gets inputs r y j and Sj Z . From these four 
inputs, the reducer can also cover Xij z and x y jk- 

We now know that the set of outputs covered by a reducer 
can be described for each j by a set Gj of row numbers of R 
and a set of column numbers Hj of S, such that the outputs 
covered are all Xijk for which i is in Gj and k is in Hj. 
As before, the greatest number of covered outputs occurs 
when the rectangle is a square. That is, each reducer has 
an equal number of rows and columns for each j. We do 
not know that these rows and columns must be the same for 
each j, but it is easy to argue that if not, we could reduce 
the communication in the first and second phases, or both, 
by using the same sets of rows and columns for each j. 

Thus, we shall assume that each reducer in the first phase 
is given a set of s rows of R, s columns of S, and t values 
of j for some s and t. Figure [5] suggests how one reducer 
covers a cube in the three-dimensional space defined by the 
indexes i, j, and k. There is a reducer covering each Xijh, 
which means that the number of reducers is (n/s) 2 (n/t). 
Then each element of matrices R and 5* must be sent ton/s 
reducers, so the total communication in the first phase is 
2n 3 /s. To see why, consider an element r%j of matrix R. We 
know i and j, so only k is unknown. The number of reducers 
that need inputs with the particular i and j and some k is 
n/s. The analogous argument applies to elements of matrix 
S. 

Each reducer produces a partial sum for s 2 pairs (i,k). 

6 The reducers in the second phase require only n inputs 
in the worst case, so we can ignore the input size for the 
second-phase reducers. 




Figure 5: The responsibility of one reducer in the 
first phase 

Thus, the communication in the second phase is s 2 times 
the number of reducers, or s 2 (n/s) 2 (n/t) = n 3 /t. The sum 
of the communication in the first and second phases is 

2n 3 n 3 
— + T 

We must minimize this function subject to the constraint 
that 2st = q, where q is the maximum number of inputs a 
reducer in the first phase can receive. The reason for this 
constraint is that such a reducer receives r»j for s different 
values of i and t different values of j, and it receives Sjk for s 
different values of k and t different values of j. The method 
of Lagrangean multipliers lets us show that the minimum is 
obtained when s = 2t. That is, t = ^/q/2 and s — y/q. 
With these values of s and t, the total communication is 

2n 3 n 3 _ 4n 3 

On the other hand, the total communication for the opti- 
mum one-phase method described in Section [6.2| is the repli- 
cation rate times the number of inputs, or 

(2n 2 /q) X 2n 2 = 4n 4 /q 

For what values of q does the one-phase method use less 
communication than the two-phase method? Whenever 

4n 4 4n 3 

q s/q 

or q > n 2 . That is, for any number of reducers except 1, the 
two-phase method uses less communication than the one- 
phase method, and for small q the two-phase approach uses 
a lot less communication. There are other costs besides com- 
munication, of course, but since both methods perform the 
same arithmetic operations the same number of times, we 
expect that in most situations, the communication difference 
is decisive. 

7. SUMMARY 

This paper has attempted to set a new direction for the 
study of optimal map-reduce algorithms. We introduced 
a simple model for map-reduce algorithms, enabling us to 
study their performance across a spectrum of possible com- 
puting clusters and computing-cluster properties such as 
communication speed and main-memory size. We identified 
replication rate and reducer input size as two parameters 
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representing the communication cost and compute-node ca- 
pabilities, respectively, and we demonstrated that for a wide 
variety of problems these two parameters are related by a 
precise tradeoff formula. These problems include finding bit 
strings at a fixed Hamming distance, finding triangles and 
other fixed sample graphs in a larger data graph, computing 
multway joins, and matrix multiplication. 

7.1 Open Problems 

The analyses done in this paper for several problems of in- 
terest should be carried out for many other problems. Dis- 
covering the tradeoff for Hamming distances greater than 
1 seems hard. Analogous investigations are warranted for 
other kinds of similarity joins besides those based on Ham- 
ming distance. One question that arises naturally is how 
closely the general lower bound on multiway joins derived 
in this paper matches the general upper bounds in 1 ? Since 
there is no closed formula for either upper or lower bounds 
in the general case, this question seems to need nontrivial 
arguments in order to be answered. 

Another interesting direction is to explore whether it is 
possible to analyze algorithms taking two or more rounds of 
map-reduce along the lines of Section |6.3[ A possible first 
place to look is at SQL statements that require two phases 
of map-reduce, e.g., joins followed by aggregations. 
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